-
Notifications
You must be signed in to change notification settings - Fork 3
Data Folding
Besides the text contents processing, retrieving information and annotating inner objects in texts,
in case of Machine Learning models it is also required to manage data subsets.
In other words, there is a need to provide rules on how the series of documents is expected to be grouped in subsets.
For example, we may declare subsets for: Training, Testing, Validating, etc.
This post represents an AREkit tutorial devoted to Folding
as a common type and the related concept on how data
separation might be described and passed into other pipelines required so.
In order to describe document separation format, AREkit-0.22.1 provides the BaseDataFolding
type,
which in short could be described in the following snippet:
class BaseDataFolding(object):
# ... other contents
def fold_doc_ids_set(self):
raise NotImplementedError()
According to the implementation above, declaring your own folding required fold_doc_ids_set
implemenation.
This implementation considers that for a given set of data types, such as Train
, Test
, Dev
, and many others,
we declare a set of the related documents.
One of the common type of folidings is the predefined one, or fixed. In this type we consider a predefined separation of the document indices between predefined data types. This folding type might be initialized as follows:
parts = {
DataType.Train: [0, 1, 2, 3],
DataType.Test: [4, 5, 6, 7]
}
fixed_folding = FixedFolding.from_parts(parts)
The absence of folding at all could be declared as follows:
no_folding = NoFolding(doc_ids=[10, 15, 20], supported_data_type=DataType.Dev)
We also consider a combination of the different foldings by providing a UnitedFolding
type.
For a particular data type, supported by at least one folding parameter, it gathers the related set of
documents and unify of all the documents behind every provided folding.
UnitedFolding
type could be initialized as follows:
united_folding = UnitedFolding([fixed_folding, no_folding])
Another folding supported in 0.22.1 is a so-called k-fold Cross-Validational one.
This folding assumes to distribute whole set of documents among k
parts.
Algorithm, which describes this distribution is a part of the so called Splitters.
AREkit provides two type of splitters out of the box.
The first one (simple version) is consider a random separation by a given seed
value.
Splitter of this type could be initialized as follows:
splitter_simple = SimpleCrossValidationSplitter(shuffle=True, seed=1)
Another type is a statistical one, in which we rely on a certain measurements, i.e. statistics caclulated by a given document, in order to then consider this statistics for a balanced distribution. By default we provide a sentence-based statistics generation, which calculates an amount of sentences of a given document.
UPD 25 February 2023: The limitation based on files has been removed in later versions, and replaced with
doc_reader_func
.
NOTE in present AREkit version we consider that this statistics is provided via file (
stat.txt
according to the snippet below)
As for doc_ops
parameter, this parameter is related to DocumentOperation
type, which we cover in the wiki page
Craft your text-opinion annotation pipeline!.
doc_ops = FooDocumentOperations()
splitter_statistical = StatBasedCrossValidationSplitter(
docs_stat=SentenceBasedDocumentStatGenerator(lambda doc_id: doc_ops.get_doc(doc_id)),
docs_stat_filepath_func=lambda: "data/stat.txt")
Once any of the splitter types declared, it is possible to initialize Cross-Validational-based folding. AREkit provides a two class folding type, which could be adopted as follows:
cv_folding = TwoClassCVFolding(supported_data_types=[DataType.Train, DataType.Test],
doc_ids_to_fold=list(range(10)),
cv_count=3,
splitter=splitter_simple)
© Nicolay Rusnachenko 2016-Present. Released under the MIT license.