Skip to content

Commit

Permalink
finalize tutorial 3 for release
Browse files Browse the repository at this point in the history
  • Loading branch information
whoisjones committed Oct 25, 2023
1 parent 184582b commit 511b47a
Showing 1 changed file with 23 additions and 90 deletions.
113 changes: 23 additions & 90 deletions tutorials/TUTORIAL-3_ADVANCED-GENERATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Movie Review: {text}
Sentiment:
```

## Inferring the prompt from dataset info
## Inferring the Prompt rom Dataset Info

Huggingface Dataset objects provide the possibility to infer a prompt from the dataset. This can be achieved by using
the `infer_prompt_from_dataset` function. This function takes a dataset
Expand Down Expand Up @@ -142,15 +142,12 @@ question: What term characterizes the intersection of the rites with the Roman C
answers: **{'text': ['full union'], 'answer_start': [104]}**
```

To do so, we provide a range of preprocessing functions for downstream tasks.
To overcome this, we provide a range of preprocessing functions for various downstream tasks.

### Text Classification

The `convert_label_ids_to_texts` function transform your text classification dataset with label IDs into textual
labels. The default will be the label names specified in the features column. You can also directly return the label
options if you want to create a custom prompt from `BasePrompt` class. In the example, we only
return them for logging since we use our `infer_prompt_from_dataset` method which automatically
uses the label names specified in the dataset.
The `convert_label_ids_to_texts` function transforms your text classification dataset with label IDs into textual
labels. The default will be the label names specified in the features column.

```python
from datasets import load_dataset
Expand Down Expand Up @@ -207,9 +204,7 @@ text: {text}
label:
```

During the dataset generation we expect the model to generate the labels in the explicitly defined label options but
do not filter if this is not the case. We observed in our experiments that this does not occur if the prompt is precise
and consistent. Once the dataset is generated, one can easily convert the string labels back to label IDs by
Once the dataset is generated, one can easily convert the string labels back to label IDs by
using huggingface's `class_encode_labels` function.

```python
Expand All @@ -221,23 +216,27 @@ print("Features: " + str(dataset.features["label"]))
Which yields:

```text
Labels: [0, 0, 0, 0, 0]
Labels: [0, 1, 1, 0, 0]
Features: ClassLabel(names=['negative', 'positive'], id=None)
```

<ins>Note:</ins> While generating the dataset, the model is supposed to assign labels based on the specific options
provided. However, we do not filter the data if it doesn't adhere to these predefined labels.
Therefore, it's important to double-check if the annotations match the expected label options.
If they don't, you should make corrections accordingly.

### Question Answering (Extractive)

For question answering, we provide two functions to preprocess and postprocess the dataset. The preprocessing function
will convert datasets in SQuAD-format to a flat format such that the inputs to the LLM will be strings.
The postprocessing function will convert the predictions back to the SQuAD-format by calculating the answer
start and log if this answer can't be found in the context or if the answer occurs multiple times.
In question answering tasks, we offer two functions to handle dataset processing: preprocessing and postprocessing.
The preprocessing function is responsible for transforming datasets from SQuAD format into flat strings.
On the other hand, the postprocessing function reverses this process by converting flat predictions back into
SQuAD format. It achieves this by determining the starting point of the answer and checking if the answer cannot be
found in the given context or if it occurs multiple times.

```python
from datasets import load_dataset
from fabricator.prompts import infer_prompt_from_dataset
from fabricator.dataset_transformations.question_answering import preprocess_squad_format,

postprocess_squad_format
from fabricator.dataset_transformations.question_answering import preprocess_squad_format, postprocess_squad_format

dataset = load_dataset("squad_v2", split="train")
prompt = infer_prompt_from_dataset(dataset)
Expand Down Expand Up @@ -268,8 +267,8 @@ answers:

### Named Entity Recognition

If you want to generate a dataset for named entity recognition without any preprocesing, the prompt would be difficult
to understand for the LLM.
If you attempt to create a dataset for named entity recognition without any preprocessing, the prompt might be
challenging for the language model to understand.

```python
from datasets import load_dataset
Expand Down Expand Up @@ -298,8 +297,9 @@ tokens: {tokens}
ner_tags:
```

To make the prompt more understandable, we can preprocess the dataset such that the labels are converted to spans.
This can be done by using the `convert_token_labels_to_spans` function. The function will also return the label options
To enhance prompt clarity, we can preprocess the dataset by converting labels into spans. This conversion can be
accomplished using the `convert_token_labels_to_spans` function. Additionally, the function will provide the
available label options:

```python
from datasets import load_dataset
Expand Down Expand Up @@ -329,7 +329,7 @@ tokens: {tokens}
ner_tags:
```

As in text classification, we can also specfiy more semantically precise labels with the `expanded_label_mapping`:
As in text classification, we can also specify more semantically precise labels with the `expanded_label_mapping`:

```python
expanded_label_mapping = {
Expand Down Expand Up @@ -535,70 +535,3 @@ generated_dataset = generated_dataset.class_encode_column("label")

generated_dataset.push_to_hub("your-first-generated-dataset")
```

### Token classification
Note: Token classification is currently still in development and often yields instable results. In particular the
conversion between spans and token labels welcomes contributions for a more stable implementation.

```python
import os
from datasets import load_dataset
from haystack.nodes import PromptNode
from fabricator import DatasetGenerator
from fabricator.prompts import BasePrompt
from fabricator.dataset_transformations import convert_token_labels_to_spans, convert_spans_to_token_labels

dataset = load_dataset("conll2003", split="train")
expanded_label_mapping = {
0: "O",
1: "B-person",
2: "I-person",
3: "B-location",
4: "I-location",
5: "B-organization",
6: "I-organization",
7: "B-miscellaneous",
8: "I-miscellaneous",
}
dataset, label_options = convert_token_labels_to_spans(dataset, "tokens", "ner_tags", expanded_label_mapping)

fewshot_dataset = dataset.select(range(10))
unlabeled_dataset = dataset.select(range(10, 20))

prompt = BasePrompt(
task_description="Annotate each token with its named entity label: {}.",
generate_data_for_column="ner_tags",
fewshot_example_columns="tokens",
label_options=list(expanded_label_mapping.values()),
)

prompt_node = PromptNode(
model_name_or_path="gpt-3.5-turbo",
api_key=os.environ.get("OPENAI_API_KEY"),
max_length=100,
)

generator = DatasetGenerator(prompt_node)
generated_dataset, original_dataset = generator.generate(
prompt_template=prompt,
fewshot_dataset=fewshot_dataset,
fewshot_examples_per_class=3, # Take 3 fewshot examples (for token-level class we do not sample per class)
unlabeled_dataset=unlabeled_dataset,
max_prompt_calls=10,
return_unlabeled_dataset=True,
)

generated_dataset = convert_spans_to_token_labels(
dataset=generated_dataset,
token_column="tokens",
label_column="ner_tags",
id2label=expanded_label_mapping
)

original_dataset = convert_spans_to_token_labels(
dataset=original_dataset,
token_column="tokens",
label_column="ner_tags",
id2label=expanded_label_mapping
)
```

0 comments on commit 511b47a

Please sign in to comment.