Skip to content

Data Folding

Nicolay Rusnachenko edited this page Feb 25, 2023 · 12 revisions

Besides the text contents processing, retrieving information and annotating inner objects in texts, in case of Machine Learning models it is also required to manage data subsets. In other words, there is a need to provide rules on how the series of documents is expected to be grouped in subsets. For example, we may declare subsets for: Training, Testing, Validating, etc. This post represents an AREkit tutorial devoted to Folding as a common type and the related concept on how data separation might be described and passed into other pipelines required so.

Tutorial code

In order to describe document separation format, AREkit-0.22.1 provides the BaseDataFolding type, which in short could be described in the following snippet:

class BaseDataFolding(object):
    # ... other contents
    def fold_doc_ids_set(self):
        raise NotImplementedError()

According to the implementation above, declaring your own folding required fold_doc_ids_set implemenation. This implementation considers that for a given set of data types, such as Train, Test, Dev, and many others, we declare a set of the related documents.

Folding Types

Fixed

One of the common type of folidings is the predefined one, or fixed. In this type we consider a predefined separation of the document indices between predefined data types. This folding type might be initialized as follows:

parts = {
    DataType.Train: [0, 1, 2, 3],
    DataType.Test: [4, 5, 6, 7]
}
fixed_folding = FixedFolding.from_parts(parts)

No Folding

The absence of folding at all could be declared as follows:

no_folding = NoFolding(doc_ids=[10, 15, 20], supported_data_type=DataType.Dev)

United Folding

We also consider a combination of the different foldings by providing a UnitedFolding type. For a particular data type, supported by at least one folding parameter, it gathers the related set of documents and unify of all the documents behind every provided folding. UnitedFolding type could be initialized as follows:

united_folding = UnitedFolding([fixed_folding, no_folding])

CV-based Folding

Another folding supported in 0.22.1 is a so-called k-fold Cross-Validational one. This folding assumes to distribute whole set of documents among k parts. Algorithm, which describes this distribution is a part of the so called Splitters.

AREkit provides two type of splitters out of the box. The first one (simple version) is consider a random separation by a given seed value. Splitter of this type could be initialized as follows:

splitter_simple = SimpleCrossValidationSplitter(shuffle=True, seed=1)

Another type is a statistical one, in which we rely on a certain measurements, i.e. statistics caclulated by a given document, in order to then consider this statistics for a balanced distribution. By default we provide a sentence-based statistics generation, which calculates an amount of sentences of a given document.

UPD 25 February 2023: The limitation based on files has been removed in later versions, and replaced with doc_reader_func.

NOTE in present AREkit version we consider that this statistics is provided via file (stat.txt according to the snippet below)

As for doc_ops parameter, this parameter is related to DocumentOperation type, which we cover in the wiki page Craft your text-opinion annotation pipeline!.

doc_ops = FooDocumentOperations()
splitter_statistical = StatBasedCrossValidationSplitter(
    docs_stat=SentenceBasedDocumentStatGenerator(lambda doc_id: doc_ops.get_doc(doc_id)),
    docs_stat_filepath_func=lambda: "data/stat.txt")

Once any of the splitter types declared, it is possible to initialize Cross-Validational-based folding. AREkit provides a two class folding type, which could be adopted as follows:

cv_folding = TwoClassCVFolding(supported_data_types=[DataType.Train, DataType.Test],
                               doc_ids_to_fold=list(range(10)),
                               cv_count=3,
                               splitter=splitter_simple)