remove datasets related documentation (#374)

* remove datasets related documentation (see ArneBinder/pie-datasets#45) * remove code snippet, but reference pie-datasets * add remark to train example
ArneBinder · Nov 11, 2023 · e2d7518 · e2d7518
1 parent 89ad722
commit e2d7518
Showing 1 changed file with 5 additions and 103 deletions.
diff --git a/README.md b/README.md
@@ -455,6 +455,8 @@ num_epochs = 10
 batch_size = 32
 
 # Get the PIE dataset consisting of PIE Documents that will be used for training (and evaluation).
+# IMPORTANT: This requires pie-datasets >=0.3.0 to be installed! See here for further information:
+# https://github.com/ArneBinder/pie-datasets
 dataset = datasets.load_dataset(
     path="pie/conll2003",
 )
@@ -543,109 +545,9 @@ trainer.fit(model, train_dataloader, val_dataloader)
 
 ## 📚 Datasets
 
-We parse all datasets into a common format that can be loaded directly from the model hub via Huggingface datasets. The documents are cached in an arrow table and serialized / deserialized on the fly. Any changes or preprocessing applied to the documents will be cached as well.
-
-```python
-import datasets
-
-dataset = datasets.load_dataset("pie/conll2003")
-
-print(dataset["train"][0])
-# >>> CoNLL2003Document(text='EU rejects German call to boycott British lamb .', id='0', metadata={})
-
-dataset["train"][0].entities
-# >>> AnnotationLayer([LabeledSpan(start=0, end=2, label='ORG', score=1.0), LabeledSpan(start=11, end=17, label='MISC', score=1.0), LabeledSpan(start=34, end=41, label='MISC', score=1.0)])
-
-entity = dataset["train"][0].entities[1]
-
-print(f"[{entity.start}, {entity.end}] {entity}")
-# >>> [11, 17] German
-```
-
-<details>
-<summary><b>How to create your own Pytorch-IE dataset</b></summary>
-
-PyTorch-IE datasets are built on top of Huggingface datasets. For instance, consider the
-[conll2003 from the Huggingface Hub](https://huggingface.co/datasets/conll2003) and especially their respective
-[dataset loading script](https://huggingface.co/datasets/conll2003/blob/main/conll2003.py). To create a PyTorch-IE
-dataset from that, you have to implement:
-
-1. A Document class. This will be the type of individual dataset examples.
-
-```python
-@dataclass
-class CoNLL2003Document(TextDocument):
-    entities: AnnotationLayer[LabeledSpan] = annotation_field(target="text")
-```
-
-Here we derive from `TextDocument` that has a simple `text` string as base annotation target. The `CoNLL2003Document`
-adds one single annotation list called `entities` that consists of `LabeledSpan`s which reference the `text` field of
-the document. You can add further annotation types by adding `AnnotationLayer` fields that may also reference (i.e.
-`target`) other annotations as you like. See ['pytorch_ie.annotations`](src/pytorch_ie/annotations.py) for predefined
-annotation types.
-
-2. A dataset config. This is similar to
-   [creating a Huggingface dataset config](https://huggingface.co/docs/datasets/dataset_script#multiple-configurations).
-
-```python
-class CoNLL2003Config(datasets.BuilderConfig):
-    """BuilderConfig for CoNLL2003"""
-
-    def __init__(self, **kwargs):
-        """BuilderConfig for CoNLL2003.
-        Args:
-          **kwargs: keyword arguments forwarded to super.
-        """
-        super().__init__(**kwargs)
-```
-
-3. A dataset builder class. This should inherit from
-   [`pytorch_ie.data.builder.GeneratorBasedBuilder`](src/pytorch_ie/data/builder.py) which is a wrapper around the
-   [Huggingface dataset builder class](https://huggingface.co/docs/datasets/v2.4.0/en/package_reference/builder_classes#datasets.GeneratorBasedBuilder)
-   with some utility functionality to work with PyTorch-IE `Documents`. The key elements to implement are: `DOCUMENT_TYPE`,
-   `BASE_DATASET_PATH`, and `_generate_document`.
-
-```python
-class Conll2003(pytorch_ie.data.builder.GeneratorBasedBuilder):
-    # Specify the document type. This will be the class of individual dataset examples.
-    DOCUMENT_TYPE = CoNLL2003Document
-
-    # The Huggingface identifier that points to the base dataset. This may be any string that works
-    # as path with Huggingface `datasets.load_dataset`.
-    BASE_DATASET_PATH = "conll2003"
-
-    # The builder configs, see https://huggingface.co/docs/datasets/dataset_script for further information.
-    BUILDER_CONFIGS = [
-        CoNLL2003Config(
-            name="conll2003", version=datasets.Version("1.0.0"), description="CoNLL2003 dataset"
-        ),
-    ]
-
-    # [Optional] Define additional keyword arguments which will be passed to `_generate_document` below.
-    def _generate_document_kwargs(self, dataset):
-        return {"int_to_str": dataset.features["ner_tags"].feature.int2str}
-
-    # Define how a Pytorch-IE Document will be created from a Huggingface dataset example.
-    def _generate_document(self, example, int_to_str):
-        doc_id = example["id"]
-        tokens = example["tokens"]
-        ner_tags = [int_to_str(tag) for tag in example["ner_tags"]]
-
-        text, ner_spans = tokens_and_tags_to_text_and_labeled_spans(tokens=tokens, tags=ner_tags)
-
-        document = CoNLL2003Document(text=text, id=doc_id)
-
-        for span in sorted(ner_spans, key=lambda span: span.start):
-            document.entities.append(span)
-
-        return document
-```
-
-The full script can be found here: [dataset_builders/conll2003/conll2003.py](dataset_builders/conll2003/conll2003.py). Note, that to
-load the dataset with `datasets.load_dataset`, the script has to be located in a directory with the same name (as it
-is the case for standard Huggingface dataset loading scripts).
-
-</details>
+PyTorch-IE works quite well together with Huggingface datasets. Have a look at
+[pie-datasets](https://github.com/ArneBinder/pie-datasets) for helpful tooling and a collection of datasets
+that are already converted to the PIE format.
 
 <!-- github-only -->